The cellular composition of complex tissues is often prerequisite knowledge for further inquiry of the function of the tissues of interest. The advent of single-cell RNA-sequencing (scRNA-seq) technology and the accompanying bioinformatics tools to analyze the resulting datasets has been a boon to biologists. In the field of behavioral neuroscience, scRNA-seq has been widely used to elucidate the neuron subtypes within a population of neurons implicated in a stereotyped behavior. Chung et al. have secured a scRNA-seq dataset from a population of neurons known to induce sleep in mice. While they have explored broad trends in their dataset and reported two general markers for that entire neuron population, they have not fully taken advantage of the power of scRNA-seq datasets to identify subtypes of neurons within the whole population. Using principal component analysis and unsupervised clustering, I have further subdivided that population into three clusters based on gene expression differences. I then defined markers for each cluster using differential gene expression analysis. While markers defined in silico may not correspond to the actual biology of the cells of interest, some of the markers I defined were corroborated by in vivo data from the Allen Brain Institute.
Chung et al. are interested in a specific population of neurons in the ventrolateral preoptic area (VLPO) and the median preoptic area (MnPO) of the hypothalamus that controls sleep in mice. They found that these neurons are active during sleep, and activating these neurons induces sleep. While the general location of these neurons is known to be the preoptic area (POA) of the hypothalamus, it is not clear which genetic markers distinguish them from other types of neurons in the POA, which also contains wake-active neurons. Knowing these markers is crucial for accurate genetic targeting and circuit analysis of these sleep-inducing neurons. Chung et al. used single-cell RNA-seq on a purified population of sleep-inducing neurons from the POA. They found that Tac1 and Pdyn are two strong markers for this population of neurons. However, they did not analyze their scRNA-seq dataset further to ask whether there might be subtypes of neurons that subdivide this population. Because sleep is a finely regulated state with distinct stages (NREM and REM), it likely involves more than a single type of neurons. Therefore, it is important to ask whether there are subtypes of sleep-inducing neurons that might coordinate different stages of sleep, and what the subtype-specific markers are.
Single-cell RNA-seq is an approach to unntangle the cell-type heterogeneity of complex tissues or cell populations. It begins with dissociating the tissue or population of interest into single cells. After lysing each cell, each cell’s RNA content is extracted and reverse-transcribed into cDNA. The cDNA is amplified into a sequencing library, from which small samples are submitted to a next-generation sequencer. The sequencing results are returned as FASTQ files, which need to be preprocessed into a counts matrix before any meaningful analysis can be done. To identify potential cell types from the counts matrix, principal component analysis (PCA) and unsupervised classification are used to define clusters that can potentially correspond to distinct cell types. Marker genes for each cluster can then be computed via differential gene expression analysis. To formally confirm that these markers actually label distinct cell types, they need to be visualized in vivo in the tissue or cell population it came from.
The dataset I am working with is a scRNA-seq dataset. It contains counts from 84 sleep-inducing neurons from the mouse POA. Each neuron was sorted into a 96-well plate through a fluorescence-activated cell-sorter. They used the SMARTer Ultra Low RNA Kit for Illumina Sequencing to reverse-transcribe the RNA, amplify the cDNA, and construct the 84 cDNA libraries. The libraries were sequenced on the Illuminia HiSeq 2000 or 2500.
A scRNA-seq dataseq differs from a microarray dataset or a bulk RNA-seq dataset. A microarray dataset will only contain information about the transcripts for which the microarray has specified probes for; it is a biased capture of RNA content from the cells of interest. RNA-seq, both bulk and single-cell, is able to capture RNA without that bias. Bulk RNA-seq provides gene expression information on the entire cell population that was sequenced. Single-cell RNA-seq goes a step further; it provides gene expression information on each individual cell that was sequenced. As a result of each method’s technical design, microarray datasets are typically smaller than bulk RNA-seq datasets, which are typically smaller than scRNA-seq datasets.
I performed the following, standard preprocessing steps on this scRNA-seq dataset from the command line:
1. Download the raw data in the form of FASTQ files from SRA.
Module: sratoolkit
2. Perform quality control of FASTQ files.
Module: fastqc
3. Align reads to a mouse reference genome (Mus_musculus.GRCm38).
Module: STAR
4. Manipulate alignments in BAM files to allow read retrieval.
Module: samtools
5. Generate a counts matrix from the BAM files.
Module: subread
In R, I further prepared the counts matrix so that it is ready for analysis:
1. Load the counts matrix and perform cell-level and gene-level quality control.
Package: SingleCellExperiment, scater
2. Convert the SingleCellExperiment object to a Seurat object.
Package: Seurat
To analyze the dataset, I used the Seurat package in R:
1. Normalize data.
2. Compute principal components (PCs) and select some for clustering.
3. Used unsupervised classification to define clusters.
4. Visualize the distribution of the clusters in a tSNE plot.
5. Perform differential gene expression analysis to identify marker genes for each cluster.
6. Visualize the distribution of individual markers in tSNE plots.
The dataset can be accessed at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE79108.
# after the preprocessing steps and getting the dataset as a Seurat object, normalize the data
sn = NormalizeData(sn, normalization.method = "LogNormalize")
# find the most highly variable genes to use for computing PCs
sn = FindVariableGenes(sn, y.cutoff = 1, x.low.cutoff = 0.0125, x.high.cutoff = 2)
# compute PCs
sn = ScaleData(sn, genes.use = sn@var.genes , vars.to.regress = c("nUMI"), use.umi = TRUE, model.use = 'negbinom', block.size = 5000)
sn = RunPCA(sn, pcs.compute = 30, maxit = 250, verbose = T)
# select the PCs that best account for the variation in the data, note the PCs with significance under 0.05
sn = JackStraw(sn, num.pc = 30)
JackStrawPlot(sn, PCs = 1:30)
# use the selected PCs to define clusters with an unsupervised classifier
sn = FindClusters(sn, dims.use = sn_dims, k.param = 10, save.SNN = T, n.start = 10, algorithm = 3, resolution = 1, print.output = F)
# visualize clusters on a tSNE plot
sn = RunTSNE(sn, reduction.use = "pca", dims.use = sn_dims, eta = 200, max_iter = 400, perplexity = 10, theta = 0.1, verbose = TRUE)
TSNEPlot(sn, do.label = T, colors.use = scales::hue_pal()(length(levels(sn@ident))))
# find differentially expressed genes for each cluster
sn.markers = FindAllMarkers(sn, logfc.threshold = 0.5, test.use = 'MAST', only.pos = T, min.diff.pct = 0.1, min.pct = 0.1, latent.vars = c('nUMI'), do.print = TRUE)
I identified 20 markers for Cluster 1, 91 markers for Cluster 2, 61 markers for Cluster 3, and 47 markers for Cluster 4. The top 10 markers for each cluster is shown below. The complete list of markers can be found in the supplement.
head(filter(sn.markers, cluster == 1),10)
head(filter(sn.markers, cluster == 2),10)
head(filter(sn.markers, cluster == 3),10)
head(filter(sn.markers, cluster == 4),10)
After visualizing the markers for each cluster on tSNE plots, it is clear that some of these markers are not exclusive to only one cluster. They were removed from further consideration as subtype markers. It also became clear that Clusters 1 and 2 share a larger number of markers, while Clusters 3 and 4 each has its own set of markers. It may be that the heterogeneity of these sleep-inducing neurons is really better described when Cluster 1 and 2 are considered as one cluster.
The following are markers that are 1) highly restricted to one of the three groups (Cluster 1/2, Cluster 3, and Cluster 4) and 2) confirmed to be in a specific region of the POA by in situ hybridization data from the Allen Brain Institute.
Clusters 1/2: C1ql2, Col19a1
C1ql2
Col19a1
Cluster 3: Gpx3, Ngb
Gpx3
Ngb
Cluster 4: Chat, Cd44, Slc5a7
Chat
Cd44
Slc5a7
I have reanalyzed Chung et al.’s scRNA-seq dataset mainly with the use of the the Seurat package in R. My analysis showed that these neurons could be subdivided into at least three groups according to their gene expression patterns. The largest group, as previously described by Chung et al., consists of neurons expressing Tac1 and Pdyn. These two markers capture the majority of the sleep-inducing neurons in this study. My further analysis, however, revealed that at least two other groups of neurons, distinct from the Tac1/Pdyn neurons, can be defined from the same dataset. One of these groups is marked by Gpx3/Ngb (Cluster 3), and another group is marked by Chat/Cd44/Slc5a7 (Cluster 4). Futhermore, the Tac1/Pdyn population has a subset of neurons that is marked by Col19a1/C1ql2 (Cluster 1/2). These results suggest that within the population of sleep-inducing neurons in the POA, genetically distinct subpopulations exist. Having the markers for each of these subpopulations paves the way to validation steps in vivo to show whether these genetically distinct cells are also anatomically distinct in the POA and functionally distinct in the fine regulation of sleep. Initial data from the Allen Brain Institute already suggest that these are anatomically distinct populations.
In addition to cell type identification in heterogeneous tissues, scRNA-seq can serve many other purposes. It can be used to identify gene expression differences bewteen cells in different states. These might be cells in different stages of the cell cycle, cells in different stages of normal development, cells that are activated versus cells that are dormant, etc. The range of biological questions that can be addressed with scRNA-seq is incredibly vast. As such, it is unlikely that scRNA-seq will go out of style anytime soon.
library(SingleCellExperiment)
library(scater)
library(Seurat)
library(MAST)
library(dplyr)
setwd("/Users/deanlee/Desktop/sleep_neurons")
# there is a header, the first column is our row names
data = read.table('counts.matrix.txt', header=T, row.names=1)
meta = data.frame(matrix(NA, nrow=ncol(data), ncol=2))
colnames(meta) = c('marker', 'position')
rownames(meta) = colnames(data)
meta[77:84,'marker'] = 'chat'
meta[77:84,'position'] = 'posterior'
meta[25:76,'marker'] = 'vgat'
meta[25:76,'position'] = 'anterior'
meta[14:24,'marker'] = 'gad'
meta[14:24,'position'] = 'anterior'
meta[1:13,'marker'] = 'gad'
meta[1:13,'position'] = 'posterior'
sce = SingleCellExperiment(assays = list(counts = as.matrix(data)),
colData = meta)
# remove genes that are not expressed in any cell
genes_to_keep = rowSums(counts(sce) > 0) > 0
sce = sce[genes_to_keep,]
# define control sequences, such as ERCC spike-ins
isSpike(sce, "ERCC") = grepl("^ERCC-", rownames(sce))
# get quality metrics, such as total_counts, total_features, etc.
sce = calculateQCMetrics(sce,
feature_controls = list(ERCC = isSpike(sce, "ERCC")))
# see number of reads per cell
hist(sce$total_counts, breaks = 100)
abline(v = 500000, col = 'red')
# remove cells with low number of reads (<500000)
filter_by_total_counts = sce$total_counts > 500000
# see genes detected per cell
hist(sce$total_features, breaks = 100)
abline(v = 5000, col = 'red')
# remove cells with low number of genes detected
filter_by_total_features = sce$total_features > 5000
# see ratio between ERCCs and endogenous RNAs
plotPhenoData(sce,
aes_string(x = "total_features", y = "pct_counts_ERCC")
)
# remove cells with greater than 10% of RNAs being ERCCs
filter_by_ERCC = sce$pct_counts_ERCC < 10
# filter cells manually based on the QC visualizations above
sce$use = (
# sufficient counts
filter_by_total_counts &
# sufficient features (genes)
filter_by_total_features &
# sufficient endogenous RNA
filter_by_ERCC
)
table(sce$use) # filtered out 7 cells
FALSE TRUE
7 77
plotQC(sce, type = "highest-expression")
# the relatively flat distribution suggests that there's good coverage of the transcriptome
# to keep a gene, require at least five reads in at least two cells
filter_genes = apply(counts(sce[,colData(sce)$use]), 1, function(x) length(x[x>5]) >= 2)
rowData(sce)$use = filter_genes
table(filter_genes) # 9102 genes filtered out, 18530 genes remain
filter_genes
FALSE TRUE
9102 18530
dim(sce[rowData(sce)$use, colData(sce)$use]) # after QC, 77 cells and 18530 genes left
[1] 18530 77
# generate log2-transformed counts, save it into a new slot
assay(sce, "logcounts_raw") = log2(counts(sce) + 1)
# remove saved PCA results from the reducedDim slot
reducedDim(sce) = NULL
# downsample
sce.qc <- sce[rowData(sce)$use, colData(sce)$use]
Down_Sample_Matrix = function (expr_mat) {
min_lib_size <- min(colSums(expr_mat))
down_sample <- function(x) {
prob <- min_lib_size/sum(x)
return(unlist(lapply(x, function(y) {
rbinom(1, y, prob)
})))
}
down_sampled_mat <- apply(expr_mat, 2, down_sample)
return(down_sampled_mat)
}
logcounts(sce.qc) <- log2(Down_Sample_Matrix(counts(sce.qc)) + 1)
# convert SingleCellExperiment object to Seurat object
sn = Convert(sce.qc, 'seurat')
# check the distribution of reads per cell pre normalization
hist(colSums(as.matrix(sn@raw.data[,colnames(sn@data)])),
breaks = 100,
main = "total expression, pre-normalization",
xlab = "# of reads per cell")
# log-normalize data
sn = NormalizeData(sn, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
# check the distribution of UMIs per cell post normalization
hist(colSums(as.matrix(sn@data)),
breaks = 100,
main = "total expression, post-normalization",
xlab = "# of reads per cell")
# check distribution of genes, UMIs/reads in each cell for outliers
VlnPlot(sn, c("nGene", "nUMI"), nCol = 2, size.title.use = 12, point.size.use = 0.005)
# plot difference measurements against nUMI/reads
GenePlot(sn, gene1 = "nUMI", gene2 = "nGene", cex.use = 0.5)
# find variable genes to use to compute PCs
sn = FindVariableGenes(sn, y.cutoff = 1, x.low.cutoff = 0.0125, x.high.cutoff = 2)
Calculating gene means
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating gene variance to mean ratios
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
length(sn@var.genes)
[1] 2048
# compute PCs
sn = ScaleData(sn, genes.use = sn@var.genes , vars.to.regress = c("nUMI"), use.umi = TRUE, model.use = 'negbinom', block.size = 5000, display.progress = F)
The following genes failed with glm.nb, and fell back to scale(log(y+1))
ENSMUSG00000106341, ENSMUSG00000046178, ENSMUSG00000106775, ENSMUSG00000063594, ENSMUSG00000030606, ENSMUSG00000101360, ENSMUSG00000022090, ENSMUSG00000032028, ENSMUSG00000111605, ENSMUSG00000097861
sn = RunPCA(sn, pcs.compute = 30, maxit = 250, verbose = T)
Working dimension size 37
Initializing starting vector v with samples from standard normal distribution.
Use `set.seed` first for reproducibility.
irlba: using fast C implementation
[1] "PC1"
[1] "ENSMUSG00000066705" "ENSMUSG00000043004" "ENSMUSG00000029608" "ENSMUSG00000076441"
[5] "ENSMUSG00000028926" "ENSMUSG00000026247" "ENSMUSG00000071379" "ENSMUSG00000039943"
[9] "ENSMUSG00000029219" "ENSMUSG00000023945" "ENSMUSG00000021919" "ENSMUSG00000028194"
[13] "ENSMUSG00000026163" "ENSMUSG00000100241" "ENSMUSG00000004085" "ENSMUSG00000042761"
[17] "ENSMUSG00000028546" "ENSMUSG00000003411" "ENSMUSG00000021273" "ENSMUSG00000096225"
[21] "ENSMUSG00000003559" "ENSMUSG00000005087" "ENSMUSG00000093738" "ENSMUSG00000003528"
[25] "ENSMUSG00000114934" "ENSMUSG00000035681" "ENSMUSG00000020230" "ENSMUSG00000029361"
[29] "ENSMUSG00000031604" "ENSMUSG00000023367"
[1] ""
[1] "ENSMUSG00000071984" "ENSMUSG00000026834" "ENSMUSG00000021379" "ENSMUSG00000031765"
[5] "ENSMUSG00000063919" "ENSMUSG00000027276" "ENSMUSG00000008540" "ENSMUSG00000026185"
[9] "ENSMUSG00000045038" "ENSMUSG00000036617" "ENSMUSG00000033060" "ENSMUSG00000020015"
[13] "ENSMUSG00000040446" "ENSMUSG00000025867" "ENSMUSG00000034109" "ENSMUSG00000025776"
[17] "ENSMUSG00000026634" "ENSMUSG00000029287" "ENSMUSG00000055489" "ENSMUSG00000053477"
[21] "ENSMUSG00000071064" "ENSMUSG00000024897" "ENSMUSG00000089872" "ENSMUSG00000062373"
[25] "ENSMUSG00000057551" "ENSMUSG00000018217" "ENSMUSG00000053025" "ENSMUSG00000036813"
[29] "ENSMUSG00000104435" "ENSMUSG00000087604"
[1] ""
[1] ""
[1] "PC2"
[1] "ENSMUSG00000038370" "ENSMUSG00000079056" "ENSMUSG00000027400" "ENSMUSG00000048721"
[5] "ENSMUSG00000097823" "ENSMUSG00000010044" "ENSMUSG00000037474" "ENSMUSG00000018774"
[9] "ENSMUSG00000019832" "ENSMUSG00000026565" "ENSMUSG00000018417" "ENSMUSG00000097340"
[13] "ENSMUSG00000046027" "ENSMUSG00000037594" "ENSMUSG00000066026" "ENSMUSG00000105942"
[17] "ENSMUSG00000039954" "ENSMUSG00000056296" "ENSMUSG00000027210" "ENSMUSG00000032224"
[21] "ENSMUSG00000021478" "ENSMUSG00000027270" "ENSMUSG00000104116" "ENSMUSG00000018809"
[25] "ENSMUSG00000038302" "ENSMUSG00000049939" "ENSMUSG00000105974" "ENSMUSG00000106025"
[29] "ENSMUSG00000029054" "ENSMUSG00000053889"
[1] ""
[1] "ENSMUSG00000022090" "ENSMUSG00000108551" "ENSMUSG00000020063" "ENSMUSG00000101360"
[5] "ENSMUSG00000057895" "ENSMUSG00000046240" "ENSMUSG00000040387" "ENSMUSG00000032271"
[9] "ENSMUSG00000042045" "ENSMUSG00000046178" "ENSMUSG00000019768" "ENSMUSG00000108953"
[13] "ENSMUSG00000004558" "ENSMUSG00000034488" "ENSMUSG00000020802" "ENSMUSG00000017009"
[17] "ENSMUSG00000025429" "ENSMUSG00000069814" "ENSMUSG00000034825" "ENSMUSG00000027962"
[21] "ENSMUSG00000063594" "ENSMUSG00000017390" "ENSMUSG00000027716" "ENSMUSG00000020098"
[25] "ENSMUSG00000030605" "ENSMUSG00000029810" "ENSMUSG00000068762" "ENSMUSG00000023367"
[29] "ENSMUSG00000023913" "ENSMUSG00000041959"
[1] ""
[1] ""
[1] "PC3"
[1] "ENSMUSG00000108551" "ENSMUSG00000039706" "ENSMUSG00000006356" "ENSMUSG00000056486"
[5] "ENSMUSG00000036295" "ENSMUSG00000033730" "ENSMUSG00000035131" "ENSMUSG00000040703"
[9] "ENSMUSG00000064181" "ENSMUSG00000036306" "ENSMUSG00000066189" "ENSMUSG00000055078"
[13] "ENSMUSG00000035270" "ENSMUSG00000051427" "ENSMUSG00000029755" "ENSMUSG00000003992"
[17] "ENSMUSG00000059173" "ENSMUSG00000061718" "ENSMUSG00000047261" "ENSMUSG00000029088"
[21] "ENSMUSG00000025372" "ENSMUSG00000058897" "ENSMUSG00000109125" "ENSMUSG00000029267"
[25] "ENSMUSG00000050074" "ENSMUSG00000021478" "ENSMUSG00000078970" "ENSMUSG00000023072"
[29] "ENSMUSG00000029291" "ENSMUSG00000005268"
[1] ""
[1] "ENSMUSG00000104435" "ENSMUSG00000021814" "ENSMUSG00000027276" "ENSMUSG00000026603"
[5] "ENSMUSG00000026185" "ENSMUSG00000045414" "ENSMUSG00000079157" "ENSMUSG00000046275"
[9] "ENSMUSG00000026836" "ENSMUSG00000058013" "ENSMUSG00000036907" "ENSMUSG00000071064"
[13] "ENSMUSG00000008540" "ENSMUSG00000036617" "ENSMUSG00000034324" "ENSMUSG00000033508"
[17] "ENSMUSG00000028630" "ENSMUSG00000026344" "ENSMUSG00000034009" "ENSMUSG00000018217"
[21] "ENSMUSG00000026156" "ENSMUSG00000029287" "ENSMUSG00000004558" "ENSMUSG00000008489"
[25] "ENSMUSG00000026141" "ENSMUSG00000022462" "ENSMUSG00000109286" "ENSMUSG00000036813"
[29] "ENSMUSG00000112287" "ENSMUSG00000033060"
[1] ""
[1] ""
[1] "PC4"
[1] "ENSMUSG00000027523" "ENSMUSG00000000214" "ENSMUSG00000020053" "ENSMUSG00000104116"
[5] "ENSMUSG00000030223" "ENSMUSG00000062151" "ENSMUSG00000048915" "ENSMUSG00000097823"
[9] "ENSMUSG00000030092" "ENSMUSG00000112302" "ENSMUSG00000019832" "ENSMUSG00000025255"
[13] "ENSMUSG00000060843" "ENSMUSG00000015396" "ERCC-00116" "ENSMUSG00000024383"
[17] "ENSMUSG00000110010" "ENSMUSG00000055653" "ENSMUSG00000079056" "ENSMUSG00000030545"
[21] "ENSMUSG00000059921" "ENSMUSG00000022324" "ENSMUSG00000054667" "ENSMUSG00000104279"
[25] "ENSMUSG00000102508" "ENSMUSG00000106025" "ENSMUSG00000095620" "ENSMUSG00000042793"
[29] "ENSMUSG00000046593" "ENSMUSG00000024867"
[1] ""
[1] "ENSMUSG00000021596" "ENSMUSG00000028289" "ENSMUSG00000021057" "ENSMUSG00000027669"
[5] "ENSMUSG00000048240" "ENSMUSG00000051726" "ENSMUSG00000024524" "ENSMUSG00000032503"
[9] "ENSMUSG00000039629" "ENSMUSG00000061718" "ENSMUSG00000036306" "ENSMUSG00000042350"
[13] "ENSMUSG00000004961" "ENSMUSG00000031342" "ENSMUSG00000069662" "ENSMUSG00000032648"
[17] "ENSMUSG00000052560" "ENSMUSG00000059173" "ENSMUSG00000038128" "ENSMUSG00000064065"
[21] "ENSMUSG00000033910" "ENSMUSG00000030067" "ENSMUSG00000056486" "ENSMUSG00000019936"
[25] "ENSMUSG00000031077" "ENSMUSG00000015568" "ENSMUSG00000006356" "ENSMUSG00000052374"
[29] "ENSMUSG00000042816" "ENSMUSG00000028232"
[1] ""
[1] ""
[1] "PC5"
[1] "ENSMUSG00000019936" "ENSMUSG00000019943" "ENSMUSG00000023868" "ENSMUSG00000033569"
[5] "ENSMUSG00000074776" "ENSMUSG00000003992" "ENSMUSG00000034157" "ENSMUSG00000029755"
[9] "ENSMUSG00000020656" "ENSMUSG00000006205" "ENSMUSG00000061762" "ENSMUSG00000035805"
[13] "ENSMUSG00000021478" "ENSMUSG00000041773" "ENSMUSG00000020607" "ENSMUSG00000006235"
[17] "ENSMUSG00000061718" "ENSMUSG00000050953" "ENSMUSG00000022708" "ENSMUSG00000038128"
[21] "ENSMUSG00000000157" "ENSMUSG00000102995" "ENSMUSG00000050473" "ENSMUSG00000045180"
[25] "ENSMUSG00000040138" "ENSMUSG00000038530" "ENSMUSG00000029309" "ENSMUSG00000025531"
[29] "ENSMUSG00000074736" "ENSMUSG00000026259"
[1] ""
[1] "ENSMUSG00000105003" "ENSMUSG00000106607" "ENSMUSG00000036913" "ENSMUSG00000050730"
[5] "ENSMUSG00000048478" "ENSMUSG00000104432" "ENSMUSG00000081471" "ENSMUSG00000085184"
[9] "ENSMUSG00000111752" "ENSMUSG00000031389" "ENSMUSG00000032648" "ENSMUSG00000104753"
[13] "ENSMUSG00000042816" "ENSMUSG00000102940" "ENSMUSG00000038010" "ENSMUSG00000060985"
[17] "ENSMUSG00000058331" "ENSMUSG00000010307" "ENSMUSG00000034930" "ENSMUSG00000111443"
[21] "ENSMUSG00000020876" "ENSMUSG00000006463" "ENSMUSG00000007659" "ENSMUSG00000050944"
[25] "ENSMUSG00000037017" "ENSMUSG00000004056" "ENSMUSG00000026269" "ENSMUSG00000032537"
[29] "ENSMUSG00000093606" "ENSMUSG00000099094"
[1] ""
[1] ""
PCElbowPlot(sn, num.pc = 30)
# perform unsupervised clustering
sn = FindClusters(sn, dims.use = sn_dims, k.param = 10, save.SNN = T, n.start = 10, algorithm = 3, resolution = 1, print.output = F)
sn = BuildClusterTree(sn, pcs.use = sn_dims, do.reorder = T, reorder.numeric = T)
Finished averaging RNA for cluster 0
Finished averaging RNA for cluster 1
Finished averaging RNA for cluster 2
Finished averaging RNA for cluster 3
Finished averaging RNA for cluster 1
Finished averaging RNA for cluster 2
Finished averaging RNA for cluster 3
Finished averaging RNA for cluster 4
# visualize clusters on tSNE plots
sn = RunTSNE(sn, reduction.use = "pca", dims.use = sn_dims, eta = 200, max_iter = 400, perplexity = 10, theta = 0.1, verbose = TRUE)
Read the 77 x 5 data matrix successfully!
Using no_dims = 2, perplexity = 10.000000, and theta = 0.100000
Computing input similarities...
Normalizing input...
Building tree...
- point 0 of 77
Done in 0.01 seconds (sparsity = 0.498229)!
Learning embedding...
Iteration 50: error is 58.848189 (50 iterations in 0.04 seconds)
Iteration 100: error is 60.817292 (50 iterations in 0.04 seconds)
Iteration 150: error is 60.439441 (50 iterations in 0.04 seconds)
Iteration 200: error is 59.507763 (50 iterations in 0.04 seconds)
Iteration 250: error is 60.151443 (50 iterations in 0.04 seconds)
Iteration 300: error is 2.037635 (50 iterations in 0.04 seconds)
Iteration 350: error is 1.337638 (50 iterations in 0.03 seconds)
Iteration 400: error is 0.725091 (50 iterations in 0.04 seconds)
Fitting performed in 0.32 seconds.
cols = scales::hue_pal()(length(levels(sn@ident)))
TSNEPlot(sn, do.label = T, colors.use = cols)
symbols = getBM(attributes = c('ensembl_gene_id', 'external_gene_name'),
filters = 'ensembl_gene_id', values = rownames(data) , mart = mart)
filter(sn.markers, cluster == 1)
filter(sn.markers, cluster == 2)
filter(sn.markers, cluster == 3)
filter(sn.markers, cluster == 4)
Clusters 1 and 2 seem to share some markers, while Clusters 3 and 4 can each be identified by some markers. It may be that the heterogeneity of these sleep-inducing neurons is really better described by three rather than four subgroups. The following are the markers for the three subgroups (Cluster 1/2, Cluster 3, and Cluster 4) that are also confirmed to be in the POA by in situ hybridization data from the Allen Brain Institute.
Clusters 1/2: Col19a1, C1ql2
Clusters 3: Gpx3, Ngb
Cluster 4: Chat, Cd44, Slc5a7
Chung et al. already reported that this population of sleep-inducing neurons is generally marked by Tac1 and Pdyn. The following plot shows that the Tac1/Pdyn neurons are largely within Cluster 1/2 and Cluster 4.
# Tac1, Pdyn
FeaturePlot(sn, features.plot = c("ENSMUSG00000061762","ENSMUSG00000027400"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
Col19a1 and C1ql2 are two largely overlapping markers of Cluster 1/2. They delineate a clear subset of the Tac1/Pdyn neurons.
# Col19a1, C1ql2
FeaturePlot(sn, features.plot = c("ENSMUSG00000026141","ENSMUSG00000036907"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
A subset of the neurons in the dataset actually does not express Tac1 or Pdyn, but expresses Chat instead. These Chat-expressing neurons are all part of Cluster 3.
# Tac1, Chat
FeaturePlot(sn, features.plot = c("ENSMUSG00000061762","ENSMUSG00000021919"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
# Pdyn, Chat
FeaturePlot(sn, features.plot = c("ENSMUSG00000027400","ENSMUSG00000021919"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
The Chat neurons are additionally marked by Slc5a7 and Cd44.
# Slc5a7, Chat
FeaturePlot(sn, features.plot = c("ENSMUSG00000023945","ENSMUSG00000021919"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
# Cd44, Chat
FeaturePlot(sn, features.plot = c("ENSMUSG00000005087","ENSMUSG00000021919"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
Gpx3 marks another population that is fairly distinct from both the Chat and Tac1/Pdyn populations.
# Gpx3, Chat
FeaturePlot(sn, features.plot = c("ENSMUSG00000018339","ENSMUSG00000021919"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
# Gpx3, Tac1
FeaturePlot(sn, features.plot = c("ENSMUSG00000018339","ENSMUSG00000061762"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
# Gpx3, Pdyn
FeaturePlot(sn, features.plot = c("ENSMUSG00000018339","ENSMUSG00000027400"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
The Gpx3 popluation is also marked by Ngb.
# Gpx3, Ngb
FeaturePlot(sn, features.plot = c("ENSMUSG00000018339","ENSMUSG00000021032"), cols.use = c("grey", "red", "blue"), reduction.use = "tsne", max.cutoff = 0.5, overlay = T)
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.34.2 bindrcpp_0.2.2 dplyr_0.7.8
[4] MAST_1.4.1 Seurat_2.3.4 Matrix_1.2-14
[7] cowplot_0.9.3 scater_1.6.3 ggplot2_3.1.0
[10] SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1 DelayedArray_0.4.1
[13] matrixStats_0.53.1 Biobase_2.38.0 GenomicRanges_1.30.3
[16] GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0
[19] BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] snow_0.4-2 backports_1.1.2 Hmisc_4.1-1 igraph_1.2.2
[5] plyr_1.8.4 lazyeval_0.2.1 shinydashboard_0.7.1 splines_3.4.4
[9] digest_0.6.15 foreach_1.4.4 htmltools_0.3.6 viridis_0.5.1
[13] lars_1.2 gdata_2.18.0 magrittr_1.5 checkmate_1.8.5
[17] memoise_1.1.0 cluster_2.0.7-1 mixtools_1.1.0 ROCR_1.0-7
[21] limma_3.34.9 R.utils_2.7.0 prettyunits_1.0.2 colorspace_1.3-2
[25] blob_1.1.1 jsonlite_1.5 crayon_1.3.4 RCurl_1.95-4.10
[29] tximport_1.6.0 bindr_0.1.1 zoo_1.8-4 ape_5.2
[33] survival_2.41-3 iterators_1.0.10 glue_1.3.0 gtable_0.2.0
[37] zlibbioc_1.24.0 XVector_0.18.0 kernlab_0.9-27 prabclus_2.2-6
[41] DEoptimR_1.0-8 abind_1.4-5 scales_0.5.0 mvtnorm_1.0-8
[45] DBI_0.8 edgeR_3.20.9 bibtex_0.4.2 Rcpp_1.0.0
[49] dtw_1.20-1 metap_1.0 viridisLite_0.3.0 xtable_1.8-2
[53] progress_1.2.0 htmlTable_1.11.2 reticulate_1.10 proxy_0.4-22
[57] foreign_0.8-69 bit_1.1-12 mclust_5.4.1 SDMTools_1.1-221
[61] Formula_1.2-2 tsne_0.1-3 htmlwidgets_1.3 httr_1.3.1
[65] gplots_3.0.1 RColorBrewer_1.1-2 fpc_2.1-11.1 acepack_1.4.1
[69] modeltools_0.2-22 ica_1.0-2 pkgconfig_2.0.1 XML_3.98-1.10
[73] R.methodsS3_1.7.1 flexmix_2.3-14 nnet_7.3-12 locfit_1.5-9.1
[77] labeling_0.3 tidyselect_0.2.5 rlang_0.3.0.1 reshape2_1.4.3
[81] later_0.7.4 AnnotationDbi_1.40.0 munsell_0.4.3 tools_3.4.4
[85] RSQLite_2.1.0 ggridges_0.5.1 evaluate_0.10.1 stringr_1.3.0
[89] yaml_2.1.18 npsurv_0.4-0 knitr_1.20 bit64_0.9-7
[93] fitdistrplus_1.0-11 robustbase_0.93-1.1 caTools_1.17.1.1 purrr_0.2.5
[97] RANN_2.6 nlme_3.1-137 pbapply_1.3-4 mime_0.5
[101] R.oo_1.22.0 xml2_1.2.0 hdf5r_1.0.1 compiler_3.4.4
[105] rstudioapi_0.7 curl_3.2 png_0.1-7 beeswarm_0.2.3
[109] lsei_1.2-0 tibble_1.4.2 stringi_1.1.7 lattice_0.20-35
[113] trimcluster_0.1-2.1 pillar_1.2.1 lmtest_0.9-36 Rdpack_0.10-1
[117] irlba_2.3.2 data.table_1.10.4-3 bitops_1.0-6 gbRd_0.4-11
[121] httpuv_1.4.5 R6_2.2.2 latticeExtra_0.6-28 promises_1.0.1
[125] KernSmooth_2.23-15 gridExtra_2.3 vipor_0.4.5 codetools_0.2-15
[129] gtools_3.8.1 MASS_7.3-49 assertthat_0.2.0 rhdf5_2.22.0
[133] rprojroot_1.3-2 rjson_0.2.20 withr_2.1.2 GenomeInfoDbData_1.0.0
[137] diptest_0.75-7 doSNOW_1.0.16 hms_0.4.2 grid_3.4.4
[141] rpart_4.1-13 tidyr_0.8.2 class_7.3-14 rmarkdown_1.10
[145] segmented_0.5-3.0 Rtsne_0.13 shiny_1.1.0 base64enc_0.1-3
[149] ggbeeswarm_0.6.0